Evaluating the Signi cance of Sequence
نویسندگان
چکیده
Sdiscover is a tool capable of nding subsequences, possibly separated by arbitrarily long gaps, in a set of sequences. These subsequences are referred to as motifs. This paper proposes a method to evaluate the signiicance of the sequence motifs found by Sdis-cover. The method is based on the minimum description length principle and Shannon's coding theory. The equivalence of the proposed method to the Bayesian inference is also discussed. 1 Introduction As the Human Genome Project 4] is expected to complete in a few years, research focus has been shifted from sequencing the biological data to mining and interpreting these data 1, 7, 9]. The interesting patterns to be mined range from genes 3], to DNA or protein sequence motifs 2, 10], to protein and RNA structure motifs 5, 9]. In this paper, we consider the problem of evaluating the signiicance of sequence motifs found by our pattern matching tool, Sdiscover 10]. Given a set of sequences, the motifs of interest are in the regular ex
منابع مشابه
Compression , Signi cance and Accuracy
Stephen Muggleton The Turing Institute, 36 North Hanover Street, Glasgow G1 2AD, UK Ashwin Srinivasan The Turing Institute, 36 North Hanover Street, Glasgow G1 2AD, UK Michael Bain The Turing Institute, 36 North Hanover Street, Glasgow G1 2AD, UK Abstract Inductive Logic Programming (ILP) involves learning relational concepts from examples and background knowledge. To date all ILP learning syst...
متن کاملSystematic and Fully Automated Identi cation of Protein Sequence Patterns
We present an ef cient algorithm to systematically and automatically identify patterns in protein sequence families. The procedure is based on the Splash deterministic pattern discovery algorithm and on a framework to assess the statistical signi cance of patterns. We demonstrate its application to the fully automated discovery of patterns in 974 PROSITE families (the complete subset of PROSI...
متن کاملMotion Differential SPIHT for Image Sequence Coding
Ef cient image sequence coding exploits both intraand interframe correlations. SPIHT is ef cient in intra-frame decorrelation for still images. Based on SPIHT, differential-SPIHT removes inter-frame redundancy by reusing the signi cance map of a SPIHT coded frame. The motion differential SPIHT (MD-SPIHT) automatically decides the coding methods for each frame, according to the inter-frame corre...
متن کاملFBST Regularization and Model Selection
We show how the Full Bayesian Signi cance Test (FBST) can be used as a model selection criterion. The FBST was presented by Pereira and Stern [3842] as a coherent Bayesian signi cance test.
متن کاملOn the statistical signi cance of temporal ring patterns in multi-neuronal spike trains
Repeated occurrences of serial ring sequences of a group of neurons with xed time delays between neurons are observed in many experiments involving simultaneous recordings from multiple neurons. Such temporal patterns are potentially indicative of underlying microcircuits and it is important to know when a repeatedly occurring pattern is statistically signi cant. These sequences are typically i...
متن کامل